Objective: The purpose of this notebook is to analyse the gender distribution of Federal University of Rio Grande do Norte's employees. The data was scrapped from the federal government transparency portal and the gender is not available in the transparency portal.
Participants:
Detailed explanation In this notebook we have to analyse the gender name of UFRN's employees. This list of employes could be requested from the Government Transparency Portal - http://www.portaldatransparencia.gov.br/. In this portal there is a list of all government bodys including the employees related to each one. To be precise about the part that concerns UFRN the url requested could be: LINK
In this scenario we have to request and process all 413 pages contaning the names of the UFRN's employees and run an analysis based on the name of then. This method is called web scrapping and we could utilize the Beatfulsoup library to make the scrapping and after run an analysis of gender.
Topics
According to Wikipedia (https://en.wikipedia.org/wiki/Web_scraping) web scrapping is a technique in witch the computer programm acquire data from human readble documents from internet. One of the most famous libraries used in python stack is the beautiful soup - https://www.crummy.com/software/BeautifulSoup/. You can easylly install it executing the command:
!pip install beautifulsoup
Importing libraries needed
In [253]:
#Loading libraries needed
#System libraries
import os
import sys
import datetime
from datetime import date
##GeoJson data and services returned info
import json
import re
import requests
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
#Basic libraries for data analysis
import numpy as np
from numpy import random
import pandas as pd
# Loaing visualization libraries
#Jupyter Magic word to inline matplotlib plots
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#Inline bokeh charts
output_notebook()
from bokeh.io import push_notebook, show, output_notebook, output_file
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.sampledata.commits import data
from bokeh.models import (
GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool, Jitter
)
from bokeh.core.properties import field
#Choropleth necessary libraries
##Necessary to create shapes in folium
from shapely.geometry import Polygon
from shapely.geometry import Point
##Choropleth itself
import folium
##Colormap
from branca.colormap import linear
Function to retrieve information of the level two of information
In [30]:
def retrieveinfo_level_two(url):
'''Retries data from the http://www.portaldatransparencia.gov.br/servidores/OrgaoExercicio-DetalhaServidor.asp '''
employee_enrollment = 0
employee_responsability = 0
employee_responsability_class = 0
employee_responsability_pattern = 0
employee_responsability_reference = 0
employee_responsability_level = 0
employee_UF = 0
employee_UORG = 0
employee_legal_regime = 0
employee_activit = 0
employee_absence_from_work = 0
employee_work_time = 0
employee_last_job_responsability_modification_date = 0
employee_job_responsability_nominee_date = 0
employee_job_responsability_nominee_act = 0
employee_last_job_responsability_modification_date_body = 0
employee_entrance_public_service_legal_document = 0
employee_entrance_public_service_legal_document_number = 0
employee_entrance_public_service_legal_document_date = 0
responsability_description = 0
responsability_activity = 0
responsability_activity_parcial = 0
responsability_UF = 0
responsability_UORG = 0
responsability_last_data_change_resp = 0
responsability_sup_body = 0
try:
r = requests.post(url)
s = bs(r.content,"html")
#First part - retrive
#print(len(s.find_all('table')))
if len(s.find_all('table')) <=4:
rows = s.find_all('table')[1].find_all('tr')
employee_enrollment = rows[2].findAll('td')[1].get_text(strip=True)
employee_responsability = rows[3].findAll('td')[1].get_text(strip=True)
employee_responsability_class = rows[4].findAll('td')[1].get_text(strip=True)
employee_responsability_pattern = rows[5].findAll('td')[1].get_text(strip=True)
employee_responsability_reference = rows[6].findAll('td')[1].get_text(strip=True)
employee_responsability_level = rows[7].findAll('td')[1].get_text(strip=True)
employee_UF = rows[9].findAll('td')[1].get_text(strip=True)
employee_UORG = rows[10].findAll('td')[1].get_text(strip=True)
employee_legal_regime = rows[17].findAll('td')[1].get_text(strip=True)
employee_activit = rows[18].findAll('td')[1].get_text(strip=True)
employee_absence_from_work = rows[19].findAll('td')[1].get_text(strip=True)
employee_work_time = rows[20].findAll('td')[1].get_text(strip=True)
employee_last_job_responsability_modification_date = rows[21].findAll('td')[1].get_text(strip=True)
employee_job_responsability_nominee_date = rows[22].findAll('td')[1].get_text(strip=True)
employee_job_responsability_nominee_act = rows[23].findAll('td')[1].get_text(strip=True)
employee_last_job_responsability_modification_date_body = rows[24].findAll('td')[1].get_text(strip=True)
employee_entrance_public_service_legal_document = rows[27].findAll('td')[1].get_text(strip=True)
employee_entrance_public_service_legal_document_number = rows[28].findAll('td')[1].get_text(strip=True)
employee_entrance_public_service_legal_document_date = rows[29].findAll('td')[1].get_text(strip=True)
else:
#Employee
rows = s.find_all('table')[3].find_all('tr')
employee_enrollment = rows[1].findAll('td')[1].get_text(strip=True)
employee_responsability = rows[2].findAll('td')[1].get_text(strip=True)
employee_responsability_class = rows[3].findAll('td')[1].get_text(strip=True)
employee_responsability_pattern = rows[4].findAll('td')[1].get_text(strip=True)
employee_responsability_reference = rows[5].findAll('td')[1].get_text(strip=True)
employee_responsability_level = rows[6].findAll('td')[1].get_text(strip=True)
employee_UF = ''#rows[9].findAll('td')[1].get_text(strip=True)
employee_UORG = rows[8].findAll('td')[1].get_text(strip=True)
employee_legal_regime = rows[16].findAll('td')[1].get_text(strip=True)
employee_activit = rows[17].findAll('td')[1].get_text(strip=True)
employee_absence_from_work = rows[18].findAll('td')[1].get_text(strip=True)
employee_work_time = rows[19].findAll('td')[1].get_text(strip=True)
employee_last_job_responsability_modification_date = rows[20].findAll('td')[1].get_text(strip=True)
employee_job_responsability_nominee_date = rows[21].findAll('td')[1].get_text(strip=True)
employee_job_responsability_nominee_act = rows[22].findAll('td')[1].get_text(strip=True)
employee_last_job_responsability_modification_date_body = rows[23].findAll('td')[1].get_text(strip=True)
employee_entrance_public_service_legal_document = rows[26].findAll('td')[1].get_text(strip=True)
employee_entrance_public_service_legal_document_number = rows[27].findAll('td')[1].get_text(strip=True)
employee_entrance_public_service_legal_document_date = rows[28].findAll('td')[1].get_text(strip=True)
#Responsabilitie
rows = s.find_all('table')[2].find_all('tr')
responsability_description = rows[3].findAll('td')[1].get_text(strip=True)
responsability_activity = rows[4].findAll('td')[1].get_text(strip=True)
responsability_activity_parcial = rows[6].findAll('td')[1].get_text(strip=True)
responsability_UF = rows[7].findAll('td')[1].get_text(strip=True)
employee_UF = responsability_UF
responsability_UORG = rows[9].findAll('td')[1].get_text(strip=True)
responsability_sup_body = rows[10].findAll('td')[1].get_text(strip=True)
responsability_last_data_change_resp = rows[19].findAll('td')[1].get_text(strip=True)
#Second part - retrive link to paycheck info and retrieve info about it
link = s.findAll("a", { "title" : "Remuneração individual do servidor" })
url_level_three = 'http://www.portaldatransparencia.gov.br' + link[0].get('href')
#print(url_level_three)
print(" L2 - OK")
return [employee_enrollment, employee_responsability, employee_responsability_class, employee_responsability_pattern, employee_responsability_reference, employee_responsability_level, employee_UF, employee_UORG, employee_legal_regime, employee_activit, employee_absence_from_work, employee_work_time, employee_last_job_responsability_modification_date, employee_job_responsability_nominee_date, employee_job_responsability_nominee_act, employee_last_job_responsability_modification_date_body, employee_entrance_public_service_legal_document, employee_entrance_public_service_legal_document_number , employee_entrance_public_service_legal_document_date, responsability_description, responsability_activity, responsability_activity_parcial, responsability_UF, responsability_UORG, responsability_sup_body, responsability_last_data_change_resp, url_level_three]
except:
pass
print(" L2 - NOINFO/FAIL URL: " + url)
return [0, '', 0, 0, 0, 0, '', '', '', '', '', '', '', '', '', '', '', 0, '', '', '', '', '', '', '', '', '']
Function to retrieve level three of information
In [28]:
def retrieveinfo_level_three(url):
'''Retries data from http://www.portaldatransparencia.gov.br/servidores/Servidor-DetalhaRemuneracao.asp'''
emp_year = 0
emp_tot_paycheck = 0
emp_event_paycheck = 0
emp_13_paycheck = 0
emp_paid_vacation = 0
emp_other = 0
emp_irrf = 0
emp_rgps = 0
emp_paycheck_after_deduction = 0
emp_other_receivings = 0
emp_other_deduction = 0
month_dict = {'janeiro':1, 'fevereiro':2, 'março':3, 'abril':4, 'maio':5, 'junho':6, 'julho':7, 'agosto':8, 'setembro':9, 'outubro':10, 'novembro':11, 'dezembro':12}
try:
r = requests.post(url)
s = bs(r.content,"html")
rows = s.find_all('tbody')[1].find_all('tr')
counterPaycheck = 0
emp_month = 9
emp_year = 2017
if len(s.find_all('tbody')[1].find_all('tr'))<10:
print(" L3 OK - NO INFO")
return [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
else:
#Remuneração básica bruta
if(len(rows[4].findAll('td')[2].get_text(strip=True))>0):
emp_tot_paycheck = re.sub('[\.]+','',rows[4].findAll('td')[2].get_text(strip=True))
emp_tot_paycheck = re.sub(',','.',emp_tot_paycheck)
else:
emp_tot_paycheck = 0.00
if(len(rows[7].findAll('td')[2].get_text(strip=True))>0):
emp_irrf = re.sub('[\.]+','',rows[7].findAll('td')[2].get_text(strip=True))
emp_irrf = re.sub(',','.',emp_irrf)
else:
emp_irrf = 0.00
if(len(rows[8].findAll('td')[2].get_text(strip=True))>0):
emp_rgps = re.sub('[\.]+','',rows[8].findAll('td')[2].get_text(strip=True))
emp_rgps = re.sub(',','.',emp_rgps)
else:
emp_rgps = 0.00
if len(s.find_all('tbody')[1].find_all('tr'))==21:
if(len(rows[11].findAll('td')[2].get_text(strip=True))>0):
emp_other_deduction = re.sub('[\.]+','',rows[11].findAll('td')[2].get_text(strip=True))
emp_other_deduction = re.sub(',','.',emp_other_deduction)
else:
emp_rgps = 0.00
if(len(rows[13].findAll('td')[1].get_text(strip=True))>0):
emp_paycheck_after_deduction = re.sub('[\.]+','',rows[13].findAll('td')[1].get_text(strip=True))
emp_paycheck_after_deduction = re.sub(',','.',emp_paycheck_after_deduction)
else:
emp_paycheck_after_deduction = 0.00
if(len(rows[18].findAll('td')[2].get_text(strip=True))>0):
emp_other_receivings = re.sub('[\.]+','',rows[18].findAll('td')[2].get_text(strip=True))
emp_other_receivings = re.sub(',','.',emp_other_receivings)
else:
emp_other_receivings = 0.00
else:
if(len(rows[10].findAll('td')[1].get_text(strip=True))>0):
emp_paycheck_after_deduction = re.sub('[\.]+','',rows[10].findAll('td')[1].get_text(strip=True))
emp_paycheck_after_deduction = re.sub(',','.',emp_paycheck_after_deduction)
else:
emp_paycheck_after_deduction = 0.00
if(len(rows[15].findAll('td')[2].get_text(strip=True))>0):
emp_other_receivings = re.sub('[\.]+','',rows[15].findAll('td')[2].get_text(strip=True))
emp_other_receivings = re.sub(',','.',emp_other_receivings)
else:
emp_other_receivings = 0.00
print(" L3 OK")
url=""
return [emp_month, emp_year, emp_tot_paycheck, emp_irrf, emp_rgps, emp_other_deduction, emp_paycheck_after_deduction, emp_other_receivings]
except:
url=""
print(" Level THREE - FAIL URL: " + url)
return [0, 0, 0, 0, 0, 0, 0, 0]
Main part of the web scrapping. The iteration about the employees of UFRN.
In [ ]:
#store data crawled
ufrnEmployeList = []
counter= 0
debug = False
#This method iterates through all the pages from the UFRN transparency page
try:
for x in range(1,413):
url = "http://www.portaldatransparencia.gov.br/servidores/OrgaoExercicio-ListaServidores.asp?CodOS=15000&DescOS=MINISTERIO%20DA%20EDUCACAO&CodOrg=26243&DescOrg=UNIVERSIDADE%20FED.%20DO%20RIO%20GRANDE%20DO%20NORTE&Pagina="+ str(x) +"+&TextoPesquisa="
print("L1 - " + str(x) +'/413 - ' + str((x/413)*100) + '% pages loaded...0')
r = requests.post(url)
s = bs(r.content,"html")
rows = s.find_all('table')[1].find_all('tr')
counter= 0
# Iterate through all lines 'tr'
for row in rows:
counter = counter + 1
if counter == 1:
continue
#Take all the columns
tdList = row.findAll('td')
#Employee's CPF
cpf = tdList[0].get_text(strip=True)
#Employee's detail link
hrefList = tdList[1].find_all('a')
empHref = hrefList[0].get('href')
#Employee's name
name = tdList[1].get_text(strip=True)
aux = [cpf, empHref, name]
print(" L2")
return_level_two = retrieveinfo_level_two("http://www.portaldatransparencia.gov.br/servidores/" + str(empHref))
print(" L3")
if(len(return_level_two[-1])>150):
return_level_three = retrieveinfo_level_three(return_level_two[-1])
else:
print(" L2 - No link to L3")
aux = aux + return_level_two + return_level_three
ufrnEmployeList.append(aux)
if debub:
break
if debub:
break
except:
print("L1 - Issues on page " + str(x) + " Line: " + str(counter) + " URL: " + str(url))
pass
print("Number of employees aquired: " + str(len(ufrnEmployeList)))
for em in ufrnEmployeList:
print(em)
In [41]:
# Transforming a list into da Pandas Data Frame
dfUfrnComplete = pd.DataFrame.from_records(ufrnEmployeList, columns=['cpf','hrefLevel2','name', 'emp_enrollment', 'emp_responsability', 'emp_responsability_class', 'emp_responsability_pattern', 'emp_responsability_reference', 'emp_responsability_level', 'emp_UF', 'emp_UORG', 'emp_legal_regime', 'emp_activit', 'emp_absence_from_work', 'emp_work_time', 'emp_last_job_responsability_modification_date', 'emp_job_responsability_nominee_date', 'emp_job_responsability_nominee_act', 'emp_last_job_responsability_modification_date_body', 'emp_entrance_public_service_legal_document', 'emp_entrance_public_service_legal_document_number', 'emp_entrance_public_service_legal_document_date', 'responsability_description', 'responsability_activity', 'responsability_activity_parcial', 'responsability_UF', 'responsability_UORG', 'responsability_sup_body', 'responsability_last_data_change_resp', 'url_level_three', 'emp_month', 'emp_year', 'emp_tot_paycheck', 'emp_irrf', 'emp_rgps', 'emp_other_deduction', 'emp_paycheck_after_deduction', 'emp_other_receivings', 'e1', 'e2', 'e3'])
Well, the data acquired above are restricted to name and part of the CPF number. With this two data we made analysis about the gender and from where they have born. Know its time to make more and to make this we need more data related. In the first step we already have required a link with a detailed information about the UFRN's employee. So we have more two levels of crawling here, like this:
The first one, the one that we have already crawled. The second level could be accesed by the link from the link in the first level, this link is alread caugth in the DataFrame. We have to iterate this DataFrame and make the request from this link. The third level, the link to it is in the level two and we will request infor right after catch the infor of the level two one by one,
Lets go...
In [42]:
dfUfrnComplete.to_csv('ufrnEmployeeList_20112017.csv',sep=',')
In [45]:
dfUfrnComplete['first_name'] = dfUfrnComplete['name'].str.split(' ').str[0]
dfUfrnComplete['last_name'] = dfUfrnComplete['name'].str.split(' ').str[-1]
In [96]:
dfUfrn['first_name'].value_counts(sort=True)
Out[96]:
In [46]:
dfUfrnComplete['first_name'].value_counts(sort=True)
Out[46]:
In [47]:
dfUfrnComplete.head(10)
Out[47]:
So after realizing the scrapping we have acquired the following informations:
In [190]:
dfUfrnComplete.columns
Out[190]:
https://api.namsor.com/onomastics/api/json/gender/FIRST_NAME/LAST_NAME/COUNTRY
like
https://api.namsor.com/onomastics/api/json/gender/Marco/Oliveira/br
In [62]:
def request_gender_namsor(row):
#Example https://api.namsor.com/onomastics/api/json/gender/Marco/Oliveira/br
try:
url = 'https://api.namsor.com/onomastics/api/json/gender/'+row['first_name']+'/'+row['last_name']+'/br'
print(url)
response = urlopen(url)
decoded = response.read().decode('utf-8')
data = json.loads(decoded)
return data['gender']
except:
return ''
In [ ]:
dfUfrnComplete['gender_namsor'] = dfUfrnComplete.apply(request_gender_namsor, axis=1)
In [65]:
dfUfrnComplete['gender_namsor'].value_counts(sort=True)
Out[65]:
As we can see above Namsor service contegorized 5920 names as male or female and the other 260 names were not categorized, or categorized as 'unknown'.
In [192]:
dfUfrnComplete['gender_namsor'].value_counts(sort=True).plot.bar()
Out[192]:
In [78]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor'] =='')]['first_name'].value_counts(sort=True)
Out[78]:
We will adjust the names that are unkown in the NAMSOR servise manually. For this, we will all first name finished in 'o', 'os', 'on', 'u', 'us' as male names. The other names, like finished in 'e', 'i', and etc do not follow a constant like male names do with o and u. However, there is a list of know names finished in 'o' that are female names and are:
In [111]:
dfUfrnComplete[dfUfrnComplete['first_name'].str.contains("AIKO | AMPARO | ANUNCIAÇÃO | ASSUNÇÃO | CALIPSO | CARMO | CARMINHO | CLÉO | CHARO | CLIO | CONCEIÇÃO | CONSOLAÇÃO | CONSUELO | DIDO | ERATO | ÍNDIGO | INO | IO | IZARO | JUNO | KEIKO | LERATO | LETO | LILO | LUCERO | MARGÔ | MIRTO | PURIFICAÇÃO | ROCÍO | ROSÁRIO | ROSARINHO | SOCORRO | TAMIKO | TARIRO | TEMISTO | YOKO")==True]
Out[111]:
as we can see there is no name of the given list se now we can generalize and say that the names that fall in the rule above described are male names
In [112]:
dfUfrnComplete['gender_namsor_adjusted'] = dfUfrnComplete['gender_namsor']
In [155]:
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted']#= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] = 'male'
In [159]:
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown')]
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']
Out[159]:
In [168]:
#dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY | AISLANIA | ALCIONE | ALMARIA | AMALUSIA | ANEIDE | ANELLYSA | AURIGENA | AUZELIVIA | BARNORA | CLAUDIANNY | CRISLUCI | CRISTHIANNE | DERISCLEIA | EDILZA | EDZANA | SANZIA | SAONARA | SEMELY | SHEYLENA | SISLLEY | SONAYDY | SORANEIDE | SUELENE | SUENI | SUENIA | SULEMI | SUZERICA | TAIZA | THAISE | THAIZA | THATYANE | VALDECY | VALDENIA | WALANNE | WALDENICE | WANDERLEIA | WANUSIA | WICLIFFE | YULYANNA | ZORAIDE ")==True]['gender_namsor_adjusted'] = 'female'
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE")]['gender_namsor_adjusted'] = 'female'
In [161]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']
Out[161]:
In [78]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor'] =='')]['first_name'].value_counts(sort=True)
Out[78]:
We will adjust the names that are unkown in the NAMSOR servise manually. For this, we will all first name finished in 'o', 'os', 'on', 'u', 'us' as male names. The other names, like finished in 'e', 'i', and etc do not follow a constant like male names do with o and u. However, there is a list of know names finished in 'o' that are female names and are:
In [ ]:
dfUfrnComplete[dfUfrnComplete['first_name'].str.contains("AIKO|AMPARO|ANUNCIAÇÃO|ASSUNÇÃO|CALIPSO|CARMO|CARMINHO|CLÉO|CHARO|CLIO|CONCEIÇÃO|CONSOLAÇÃO|CONSUELO|DIDO|ERATO|ÍNDIGO|INO|IO|IZARO|JUNO|KEIKO|LERATO|LETO|LILO|LUCERO|MARGÔ|MIRTO|PURIFICAÇÃO|ROCÍO|ROSÁRIO|ROSARINHO|SOCORRO|TAMIKO|TARIRO|TEMISTO|YOKO")==True]
as we can see there is no name of the given list se now we can generalize and say that the names that fall in the rule above described are male names
In [112]:
dfUfrnComplete['gender_namsor_adjusted'] = dfUfrnComplete['gender_namsor']
In [155]:
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted']#= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] = 'male'
In [159]:
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown')]
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']
Out[159]:
In [168]:
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE")]['gender_namsor_adjusted'] = 'female'
Adjusting failed retried names
In [474]:
dfUfrnComplete.loc[(dfUfrnComplete['gender_namsor'] == ''),'gender_namsor_adjusted'] = 'unknown'
Out[474]:
In [161]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']
Out[161]:
In [48]:
from pygenderbr import Gender
gapi = Gender()
For purposes of data adjustment we changed the return from 'M' to 'male', 'F' to 'female' and '' to 'unknown'
In [338]:
def request_gender_pygenderbr(row):
nome = row['first_name']
gender_api = gapi.getgender(nome)
gender_return = ''
if gender_api[0] == 'M':
gender_return = 'male'
elif gender_api[0] == 'F':
gender_return = 'female'
elif gender_api[0].strip() == '':
gender_return = 'unknown'
else:
gender_return = gender_api[0]
print("Nome: " + nome + " Gender: " + gender_return + " | ", end="")
return gender_return
In [358]:
print("Stating to request Py GenderBR")
start_time = datetime.datetime.now()
print(start_time)
dfUfrnComplete['gender_pygenderbr'] = dfUfrnComplete.apply(request_gender_pygenderbr, axis=1)
print("Finished to request Py GenderBR")
finish_time = datetime.datetime.now()
print(finish_time)
In [368]:
print("Total time to complete gender requests: ")
print(finish_time-start_time)
In [359]:
dfUfrnComplete['gender_pygenderbr'].value_counts()
Out[359]:
In [372]:
dfUfrnComplete['gender_pygenderbr_adjusted'] = dfUfrnComplete['gender_pygenderbr']
Adjusting male names
In [394]:
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
Adjusting female names
In [408]:
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE"),'gender_pygenderbr_adjusted'] = 'female'
In [406]:
dfUfrnComplete[dfUfrnComplete['gender_pygenderbr']=='unknown']['first_name']
Out[406]:
In [409]:
dfUfrnComplete['gender_pygenderbr_adjusted'].value_counts()
Out[409]:
Another library tested is the gender-guesser. This library is a fork of the sexmachine library, no longer maintained as we https://pypi.python.org/pypi/gender-guesser/#downloads
!pip install gender-guesser
In [169]:
import gender_guesser.detector as gender
In [184]:
def requestgender_genderguesser(row):
try:
d = gender.Detector()
name = row['first_name']
return d.get_gender(name.title())
except:
return 'unknown'
In [185]:
dfUfrnComplete['gender_genderguesser'] = dfUfrnComplete.apply(requestgender_genderguesser, axis=1)
In [187]:
dfUfrnComplete['gender_genderguesser'].value_counts(sort=True)
Out[187]:
The Gender Guesser API utilizes more classification values as its return, so we made a adjustment to make all three tools/services similar in output. The Gender Guesser output andy, mostly_female and mostly_male will be categorized as unknown.
In [365]:
dfUfrnComplete['gender_genderguesser_adjusted'] = dfUfrnComplete['gender_genderguesser']
dfUfrnComplete.loc[(dfUfrnComplete['gender_genderguesser_adjusted'] == 'andy') | (dfUfrnComplete['gender_genderguesser_adjusted'] =='mostly_female') | (dfUfrnComplete['gender_genderguesser_adjusted'] =='mostly_male'),'gender_genderguesser_adjusted'] = 'unknown'
In [367]:
dfUfrnComplete['gender_genderguesser_adjusted'].value_counts(sort=True)
Out[367]:
Adjusting male names, only three names according to this rule were found.
In [411]:
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
Adjusting female names 48 known female names found but this time only one was considered female
In [420]:
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE"),'gender_genderguesser_adjusted'] = 'female'
In [422]:
dfUfrnComplete['gender_genderguesser'].value_counts(sort=True)
Out[422]:
In [423]:
dfUfrnComplete['gender_genderguesser_adjusted'].value_counts(sort=True)
Out[423]:
As we can see below, after common adjusts made to all tools (male names and female known names finished in the general rule of male names) we finished with the folowing results. Notice that the Gender Guesser is the tool with more unknown results followed by Gender BR. It's important to remember that the Namsor asks for the country origin of the name. The gender br are based on Brazil's CENSUS and the gender guesser is a group of data from some coutries to try to gueess the gender of the name.
In [525]:
result1 = dfUfrnComplete[['gender_namsor_adjusted','gender_pygenderbr_adjusted','gender_genderguesser_adjusted']]
result1 = result1.apply(lambda x: x.value_counts())
print(result1)
result1.plot.bar(figsize=(17,5))
Out[525]:
In the year of 2000 was constituted the Lei de Responsabilidade Fiscal a budget and fiscal law for expenditures in the municipalities of Brazil. One of the main points in this law is about the transparency what is called acitve transparency that is the town hall, the legislative chamber and the other bodies of the government has to open the data in his following transparency portals. After that we had the Lei de Acesso à Informação another law with the aim to make improvments in the transparency portal as well as the access to information and know covering the passive transparency that is when the citizes asks for government information not available in the transparency portal. One one the points that the transparency portal has to obbey is to masks all the private data and it is what happens with http://www.portaldatransparencia.gov.br/ that masks the CPF number of the employees. But knowing the structure of the CPF we could analyse from the last number, excluding the check number (two last digits) we could map from where is the origin of the person. So we made a simple function to label from where the employees of UFRN came from according to https://pt.wikipedia.org/wiki/Cadastro_de_pessoas_f%C3%ADsicas and https://janio.sarmento.org/curiosidade-identificacao-de-cpf-conforme-o-estado/ we do have that the last digit has the correlation bellow:
In [68]:
def label_region (row):
if row['cpf'][10:11] == '1':
return 'Distrito Federal, Goiás, Mato Grosso, Mato Grosso do Sul e Tocantins'
if row['cpf'][10:11] == '2':
return 'Amazonas, Pará, Roraima, Amapá, Acre e Rondônia'
if row['cpf'][10:11] == '3':
return 'Ceará, Maranhão e Piauí'
if row['cpf'][10:11] == '4':
return 'Paraíba, Pernambuco, Alagoas e Rio Grande do Norte'
if row['cpf'][10:11] == '5':
return 'Bahia e Sergipe'
if row['cpf'][10:11] == '6':
return 'Minas Gerais'
if row['cpf'][10:11] == '7':
return 'Rio de Janeiro e Espírito Santo'
if row['cpf'][10:11] == '8':
return 'São Paulo'
if row['cpf'][10:11] == '9':
return 'Paraná e Santa Catarina'
if row['cpf'][10:11] == '0':
return 'Rio Grande do Sul'
return ''
In [69]:
# Apply label by CPF locale information
dfUfrnComplete['cpf_region'] = dfUfrnComplete.apply(label_region, axis=1)
In [70]:
ufrnOrigin = dfUfrnComplete['cpf_region'].value_counts(sort=True).reset_index()
ufrnOrigin.columns=['name', 'count']
ufrnOrigin
Out[70]:
In [72]:
# import geojson file about natal neighborhood
fiscal_region = os.path.join('geojson', 'mapCPF.geojson')
# load the data and use 'UTF-8'encoding
geo_json_fiscal = json.load(open(fiscal_region,encoding='UTF-8'))
In [73]:
fiscal = []
# list all fiscal regions
for neigh in geo_json_fiscal['features']:
fiscal.append(neigh['properties']['name'])
Out[73]:
In [74]:
colorscaleFiscalRegion = linear.OrRd.scale(ufrnOrigin['count'].min(), ufrnOrigin['count'].max())
threshold_scale = [ufrnOrigin['count'].min(), 80, 150, 200, ufrnOrigin['count'].max()]
In [75]:
# Create a map object
m = folium.Map(
location=[-14.150767, -51.057477],
zoom_start=4,
tiles='cartodbpositron'
)
#
m.choropleth(
geo_data=geo_json_fiscal,
data=ufrnOrigin,
columns=['name', 'count'],
key_on='feature.properties.name',
fill_color='OrRd',
legend_name='UFRN - Employee region of nascence',
highlight=True,
threshold_scale = threshold_scale,
line_color='red',
line_weight=0.2,
line_opacity=0.6
)
Bellow we have a choropleth according to the last digit of CPF. This digit represent the fiscal regions extracted from the last tirth most right digit from CPF. As we can notice the region redish is the one with the most nascence region of UFRN's employees and its is the 4th fiscal region. Following the 4th comes the 8th (São Paulo) and 3th (Ceará, Maranhão and Piauí) fiscal regions.
In [76]:
m
Out[76]:
In [198]:
dfUfrnComplete['emp_responsability'].value_counts(sort=True)
Out[198]:
In [213]:
result = dfUfrnComplete.groupby(['emp_responsability','gender_namsor_adjusted'])['emp_responsability'].count().unstack('gender_namsor_adjusted').fillna(0)
result
#dfUfrnComplete['emp_responsability'].value_counts(sort=True)
#filter = pessoas_tce.groupby(['setor','GENERO'])['setor'].count().unstack('GENERO').fillna(0)
#filter.sum(axis=1).sort_values().plot.barh(figsize=(17,30))
Out[213]:
In [216]:
result.sum(axis=1).sort_values().plot.barh(stacked=True,figsize=(17,30))
Out[216]:
In [220]:
result = dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].groupby(['emp_UORG','gender_namsor_adjusted'])['emp_UORG'].count().unstack('gender_namsor_adjusted').fillna(0)
result
Out[220]:
In [221]:
result.sum(axis=1).sort_values().plot.barh(stacked=True,figsize=(17,30))
Out[221]:
In [235]:
result = dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].groupby(['cpf_region','gender_namsor_adjusted'])['cpf_region'].count().unstack('gender_namsor_adjusted').fillna(0)
result
Out[235]:
Distribution according to local of nascence, CPF fiscal region. We can see tha only 'Rio Grande do Sul' and 'Amazonas,Pará, Roraima, Amapá e Rondonia' fiscal regions that there are more females than male professors.
In [263]:
result[['male','female']].plot.bar(color=sns.color_palette(),figsize=(17,10))
Out[263]:
In [ ]:
#result = dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].groupby(['cpf_region','gender_namsor_adjusted'])['cpf_region'].count().unstack('gender_namsor_adjusted').fillna(0)
#result
dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')]
In [78]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor'] =='')]['first_name'].value_counts(sort=True)
Out[78]:
We will adjust the names that are unkown in the NAMSOR servise manually. For this, we will all first name finished in 'o', 'os', 'on', 'u', 'us' as male names. The other names, like finished in 'e', 'i', and etc do not follow a constant like male names do with o and u. However, there is a list of know names finished in 'o' that are female names and are:
In [111]:
dfUfrnComplete[dfUfrnComplete['first_name'].str.contains("AIKO | AMPARO | ANUNCIAÇÃO | ASSUNÇÃO | CALIPSO | CARMO | CARMINHO | CLÉO | CHARO | CLIO | CONCEIÇÃO | CONSOLAÇÃO | CONSUELO | DIDO | ERATO | ÍNDIGO | INO | IO | IZARO | JUNO | KEIKO | LERATO | LETO | LILO | LUCERO | MARGÔ | MIRTO | PURIFICAÇÃO | ROCÍO | ROSÁRIO | ROSARINHO | SOCORRO | TAMIKO | TARIRO | TEMISTO | YOKO")==True]
Out[111]:
as we can see there is no name of the given list se now we can generalize and say that the names that fall in the rule above described are male names
In [112]:
dfUfrnComplete['gender_namsor_adjusted'] = dfUfrnComplete['gender_namsor']
In [155]:
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted']#= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] = 'male'
In [159]:
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown')]
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']
Out[159]:
In [168]:
#dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY | AISLANIA | ALCIONE | ALMARIA | AMALUSIA | ANEIDE | ANELLYSA | AURIGENA | AUZELIVIA | BARNORA | CLAUDIANNY | CRISLUCI | CRISTHIANNE | DERISCLEIA | EDILZA | EDZANA | SANZIA | SAONARA | SEMELY | SHEYLENA | SISLLEY | SONAYDY | SORANEIDE | SUELENE | SUENI | SUENIA | SULEMI | SUZERICA | TAIZA | THAISE | THAIZA | THATYANE | VALDECY | VALDENIA | WALANNE | WALDENICE | WANDERLEIA | WANUSIA | WICLIFFE | YULYANNA | ZORAIDE ")==True]['gender_namsor_adjusted'] = 'female'
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE")]['gender_namsor_adjusted'] = 'female'
In [161]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']
Out[161]:
In [264]:
dfUfrnComplete.columns
Out[264]:
In [279]:
result = dfUfrnComplete[(dfUfrnComplete['responsability_description'].str.len() >3)&((dfUfrnComplete['gender_namsor_adjusted'] == 'male')|(dfUfrnComplete['gender_namsor_adjusted'] == 'female'))].groupby(['responsability_description','gender_namsor_adjusted'])['responsability_description'].count().unstack('gender_namsor_adjusted').fillna(0)
result.plot.bar(figsize=(17,10))
Out[279]:
In [280]:
print(result)
In [302]:
result['f_percentage'] = result['female']+result['male']
result['m_percentage'] = result['female']+result['male']
result['f_percentage'] = result['female']/result['f_percentage']
result['m_percentage'] = result['male']/result['m_percentage']
result['f_percentage'] = result['f_percentage']*100
result['m_percentage'] = result['m_percentage']*100
In [305]:
result[['f_percentage','m_percentage']].plot.bar(figsize=(17,10))
Out[305]:
In [308]:
print(result[['f_percentage', 'm_percentage']].mean())
result[['f_percentage', 'm_percentage']].mean().plot.bar(color=sns.color_palette(),figsize=(17,10))
Out[308]:
In [ ]:
dfUfrnComplete.emp_tot_paycheck = dfUfrnComplete.emp_tot_paycheck.astype(dtype="float")
dfUfrnComplete.emp_tot_paycheck.describe()
In [505]:
dfUfrnComplete.emp_tot_paycheck.copy().sort_values(ascending=False)
Out[505]:
In [506]:
dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].emp_tot_paycheck.copy().sort_values(ascending=False)
Out[506]:
In Brazil a rule is to make a contest to select the best possible employee available or in some cases there is the figure of 'Cargo em Comissão' is a position of trust and the elected one could choose whoever it wants. In the Federal Government there are no distinctions about gender. Every employee follow a path to carrer growth and technically if a person follow a determined path there is not possibility to salary distinctions based on gender. So we did not considered evaluate the amount of salary based on gender. Therefore, we considered that in the public service exists positions of trust that receive a plus in the salary. As we could detect in the last section there is a difference that is almost 30% or to be precise:
Another interesting fact is that at least 1068 people were born in another states it represents 17,28% of the total of UFRN's employees. At least beacuse the fiscal region is composed by RN and another trhree states.
We also detected that are a good number of employees that passes the limit of how much a government employee is allowed by law to receive. From the TOP30 paychecks only 9 were from administratives jobs all the others are Professors.
In general the proportion of gender is 6180
Tool | Male(%) | Female(%) | Unknown(%) |
---|---|---|---|
Namsor | 50.66 | 46.18 | 3.15 |
(Py)Gender BR | 49.46 | 45.58 | 4.95 |
Gender Guesser | 44.62 | 34.23 | 21.13 |
As we can see, the guess the gender of a name is highly dependable from which country is associated. The gender guesser, a tool that we did not specified the origin of the names could not be effective for this scenario.